Data Visualization: ggplot2 tutorial using gapminder dataset
If I can’t picture it, I can’t understand it. Albert Einstein
Overview
In this tutorial we look at some of the data on wealth and life
expectancy of countries over time used by Hans Rosling, known as
gapminder. The goal is to provide an overview of how to
graph a variable (data) depending on its type, introduce some simple 1D
and 2D plots constructed using ggplot2() and provide an
outline of the layered grammar of graphics upon which
ggplot2() is built.
Learning objectives
- Generate plots from data according to their type (discrete, continuous …)
- Manage plot settings
- Produce plots from data in a data frame
- Modify and customize a plot
- Create complex and fancy plot
Loading/installing packages
library(ggplot2) #for plotting
library(dplyr) #for data manipulation
library(scales) #for graphical scales
library(gapminder) #for dataset
library(plotly) # adds a frame aesthetic to ggplot, and allows interactive, linked views of a series of frames over timeLet’s have a look to our data structure
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
The print() method gives an abbreviated printout.
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
It is useful to get some overview of the variables before getting started.
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
We will want to look at trends over time by continent.
How many countries are in this data set in each continent? There are 12
years for each country. Are the data complete? table()
gives an answer.
##
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
Note: we used the $ symbol with
data$variable notation because table() doesn’t
have a data= argument. Another way to do this is to use the
with() function, that makes variables in a data set
available directly. The same table can be obtained using:
## year
## continent 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## Africa 52 52 52 52 52 52 52 52 52 52 52 52
## Americas 25 25 25 25 25 25 25 25 25 25 25 25
## Asia 33 33 33 33 33 33 33 33 33 33 33 33
## Europe 30 30 30 30 30 30 30 30 30 30 30 30
## Oceania 2 2 2 2 2 2 2 2 2 2 2 2
1D plots: Bar plots for discrete variables
As we have seen previously during the lecture, the distribution of a
categorical variable is better vizualised using a bar plot. For example,
continent. With ggplot2, this is relatively easy:
- we start by mapping the
xvariable tocontinent - then, we add a
geom_bar()layer, that counts the observations in each category and plots them as bar lengths.
To make this more colorful, you can also map the fill
attribute to continent.
With ggplot2 features, we will be able also to:
- change the default color schemes
- modify labels
- change the legend position, or eliminate it in same case
- flip axis …
Let’s try some !
- We will change the y axis,
count, ingeom_bar()to..count../12in order to represent the number of countries. - Change the label of the y axis by a more meaningful one:
countries - Suppress the default legend for continent, which is redundant in this case
ggplot(gapminder, aes(x=continent, fill=continent)) +
geom_bar(aes(y= ..count../12)) +
labs(y="Number of countries") +
guides(fill=FALSE)Note: Ever plot in ggplot2 is a
ggplot object.
If you want to save a given plot for a future use, store it in a
variable by using: mybar <- ggplot() + ... `
mybar <- ggplot(gapminder, aes(x=continent, fill=continent)) +
geom_bar(aes(y=..count../12)) +
labs(y="Number of countries") +
guides(fill=FALSE)
mybarSome other ggplot2 features
- Transforming coordinates using
coord_transfunction
- Flipping axes using
coord_flipfunction
- Transform to polar coordinates
1D plots: density plots for continuous variables
The gapminder data set contains several continuous
variables: life expectancy (lifeExp), population
(pop) and gross domestic product per capita
(gdpPercap) for each year and country. For such variables,
density plots provide a useful graphical summary.
Let’s start by exploring life expectancy. The simplest plot uses this
as the horizontal axis, aes(x=lifeExp) and then adds
geom_density() to calculate and plot the smoothed frequency
distribution.
We have several features to make this plot prettier. Changing the
line thickness (size=), add a fill color
(fill=""), and make the fill color partially transparent
(alpha=).
Differences by continent
The plot of lifeExp is bimodal, and looks not obvious. We need to add
another aesthetic attribute, fill=continent, which is
inherited in geom_density() to see more details about
countries among continents.
Note 1: We used transparent colors
(alpha=) to see more clearly the different distributions
across continent.
Note 2: It is easy now to see that African countries differ markedly from the rest.
boxplots and other visual summaries
You might want to visualize the distributions of life expectancy by
another visual summary, grouped by continent. All you need
to do is change the aesthetic to show continent on one
axis, and life expectancy (lifeExp) on the other.
Then, add ageom_boxplot() layer:
Challenge 1
- Remove the legend from this plot
- Make the plot horizontal
- Instead of a boxplot, try
geom_violin()
Effect ordering
The continents are a factor, and are ordered alphabetically by default. It might be more useful to order them by the mean or median life expectancy.
In this example, I use the dplyr “pipe” notation
(%>%) to send the gapminder data to the
dplyr:;mutate() function, and within that,
reorder() the continents by their median life
expectancy.
## # A tibble: 1,704 × 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
## 7 Afghanistan Asia 1982 39.9 12881816 978.
## 8 Afghanistan Asia 1987 40.8 13867957 852.
## 9 Afghanistan Asia 1992 41.7 16317921 649.
## 10 Afghanistan Asia 1997 41.8 22227415 635.
## # ℹ 1,694 more rows
Note: In other situations, you could use
FUN=mean, FUN=sd, or FUN=max to
sort the levels by their means, standard deviatons, maximums, or any
other function.
We can now pipe the result of this right into
ggplot:
gapminder %>%
mutate(continent = reorder(continent, lifeExp, FUN=median)) %>%
ggplot(aes(x=continent, y=lifeExp, fill=continent)) +
geom_boxplot(outlier.size=2)Exploring at GDP
Let’s look at the distribution of gdpPercap in a similar
way, starting with the unconditional distribution.
Challenge 2
- As we did for
lifeExpplot the distributions separately for each continent - It is probably more useful to plot GDP on a log scale. Add another
layer that transforms the
xaxis tolog10(gdpPercap). - Make boxplots of
gdpPercapbycontinent. - Do the same, but plot GDP on a log scale.
1.5D: Layers & Time series plots
Layers
Exploring how life expectancy change with GDP per country, for
expample china. We can use geom_line to make a line
plot.
china <- ggplot(subset(gapminder, country =="China"), #subsetting data
aes(x=gdpPercap, y=lifeExp))
china + geom_line() We can use both geom_line and geom_point to
make a line plot with points at the data values.
Note: This brings up another important concept with ggplot2: layers. A given plot can have multiple layers of geometric objects, plotted one on top of the other.
If we make the lines and points different colors, we can see that points are placed on top of the lines, since they are in the second layer.
If we switch the order of geom_point() and
geom_line(), we’ll reverse the layers.
Note: aesthetics that are included in the call to
ggplot2() (or completely separately) are made to be the
defaults for all layers, but we can separately control the aesthetics
for each layer. For example, we could color the points by year:
With a rainbow:
china + geom_line() + geom_point(aes(color=year))+ scale_color_gradientn(colours = rainbow(5)) #with a rainbowColoring both points and lines:
china + geom_line() + geom_point() + aes(color=year)+ scale_color_gradientn(colours = rainbow(5)) #both with rainbow shadeChallenge 3
- Make a plot of
lifeExpvsgdpPercapfor China and India, with both lines and points.
Time series plot
Exploring how has life expectancy changed over time. The simplest way
to to plot a line for each country over year. To do this,
we use the group aesthetic.
Adding colors:
ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #adding color
geom_line()Changing colors shade:
ggplot(gapminder, aes(x=year, y=lifeExp, group=country , color = continent)) + #changing colors shade
geom_line(alpha = 0.5)Plotting a summary
A better look at trends over time is to find the mean or median for
each year and continent and plot those.
gapminder %>%
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) %>% head() #median for each year and continent## # A tibble: 6 × 3
## # Groups: continent [1]
## continent year lifeExp
## <fct> <int> <dbl>
## 1 Africa 1952 38.8
## 2 Africa 1957 40.6
## 3 Africa 1962 42.6
## 4 Africa 1967 44.7
## 5 Africa 1972 47.0
## 6 Africa 1977 49.3
One nice feature of the dplyr and tidyverse
framework, is that you can pipe the result of such a summary directly to
ggplot():
gapminder %>% #piping to ggplot
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) %>%
ggplot(aes(x=year, y=lifeExp, color=continent)) +
geom_line(size=1) +
geom_point(size=1.5)If you want to make several plots of such a summarized data set, save the result in a new object.
gapminder %>% #saving in a new dataset using assignement
group_by(continent, year) %>%
summarise(lifeExp=median(lifeExp)) -> gapyearLet’s play with our plot and make it more fancy!
We can fit linear regression lines for each continent
instead of joining all the points:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #fitting linear regression lines for each continent
geom_point(size=1.5) +
geom_smooth(aes(fill=continent), method="lm")We can also use a loess smooth rather than a linear
regression:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #using a loess smooth
geom_point(size=1.5) +
geom_smooth(aes(fill=continent), method="loess")We can change the default use of legends by placing it inside the plot:
ggplot(gapyear, aes(x=year, y=lifeExp, color=continent)) + #using a loess smooth
geom_point(size=1.5) +
theme(
legend.position = c(0.99, 0.03),
legend.justification = c("right", "bottom") #placing the legend inside the plot
)+
geom_smooth(aes(fill=continent), method="loess")2D: Scatterplots
Let’s explore the relationship between life expectancy and GDP with a scatterplot,
A basic scatterplot is set up by assigning two variables to the
x and y aesthetic attributes then we can add
the points in another layer.
Or, color them by continent.
For a better look, we can also add a smoothed curve for all the data:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") #adding a smoothed curve for all the dataAs we have seen earlier about GDP, this variable is better plotted on a log scale:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10() #plotting on a log scaleCustomizing the plot
The last plot, on the log scale has ugly labels, let’s try to adjust the scale:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) #adjusting scaleMoving the legends inside the plot:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) +
theme(legend.position = c(0.8, 0.2)) # putting the legend inside the plotChanging the theme:
plt + geom_point(aes(color=continent)) +
geom_smooth(method="loess") +
scale_x_log10(labels=scales::comma) +
theme_bw() #changing the theme of the plotReplacing the single loess smoothed curve with a separate regression line for each continent:
plt + geom_point(aes(color=continent)) +
geom_smooth(aes(fill= continent) , method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw() #smoothing by a regression line for each continentMaking a “bubble” plot, mapping the size of each point
to population (pop)
plt + geom_point(aes(size = pop, color=continent)) + #making a bubble plot by mapping the size of each point to population
geom_smooth(method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw()Changing color shades:
plt + geom_point(aes(size = pop, color=continent), alpha = 0.5) + #changing colors shade
geom_smooth(method="lm") +
scale_x_log10(labels=scales::comma) +
theme_bw()Let’s explore life expectancy by continent for a giving year. To do that, we will need to filter our data.
gm_2007 <- subset(gapminder, year==2007) #filtering data by picking those of 2007
ggplot(gm_2007, aes(y=lifeExp, x=continent)) + geom_point()ggplot(gm_2007, aes(y=lifeExp, x=continent)) +
geom_point(position=position_jitter(width=0.1, height=0)) #changing scale by jittering Advanced customized and fancy plot
Bubble plot
Explorinf gdp versus life expectancy in 2007 with highlighting the larger countries filter our data.
ggplot(gm_2007) +
geom_point(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop),# add scatter points
alpha = 0.5) +
geom_text(aes(x = gdpPercap, y = lifeExp + 3, label = country), # add some text annotations for the very large countries
color = "grey50",
data = filter(gm_2007, pop > 1000000000 | country %in% c("Nigeria", "United States"))) +
scale_x_log10(limits = c(200, 60000)) + # clean the axes names and breaks
labs(title = "GDP versus life expectancy in 2007", # change labels
x = "GDP per capita (log scale)",
y = "Life expectancy",
size = "Popoulation",
color = "Continent") +
scale_size(range = c(0.1, 10), # change the size scale
guide = "none") + # remove size legend
theme_classic() + # add a nicer theme
theme(legend.position = "top", # place legend at top and grey axis lines
axis.line = element_line(color = "grey85"),
axis.ticks = element_line(color = "grey85"))